BasicElectionForensics()
The BasicElectionForensics function is a comprehensive R
tool designed to perform statistical analysis on election data to detect
potential irregularities or fraud patterns. This function implements
multiple forensic methods commonly used in election integrity research
and provides both statistical results and significance testing through
bootstrap methods.
BasicElectionForensics(data, Candidates, Level="National",
TotalReg, TotalVotes, Methods, R=1000, cores=2)
Output
| Output |
Type |
Description |
table |
data.frame |
Numerical results of the election forensic tests |
tex |
xtable, data.frame |
LaTeX-formatted results table suitable for academic use |
html |
datatables, htmlwidget |
Interactive HTML table with color-coded significance |
sigMatrix |
matrix, array |
Binary significance matrix for further statistical analysis |
Statistical Distribution Tests
_2BL - Second-digit mean test
(Benford’s Law analysis)
LastC - Last-digit mean test for
uniform distribution
P05s - Percentage last-digit 0/5
indicator (for percentages)
C05s - Count last-digit 0/5 indicator
(for vote counts)
Skew - Skewness measure (asymmetry
from normal distribution)
Kurt - Kurtosis measure (tail
heaviness compared to normal distribution)
DipT - Unimodality test (Hartigan’s
dip test)
Sobyanin - Sobyanin-Sukhovolsky
measure (turnout-vote share relationship)
Correlation - Correlation coefficient
between turnout and vote share
Key Features
- Uses nonparametric bootstrap with configurable number of simulations
(R parameter)
- Provides confidence intervals for statistical significance
- Supports parallel processing for improved performance
- Works with various election data formats
- Handles missing data appropriately
- Supports multi-level analysis (e.g., national, regional, local)
Usage Example
library(EFToolkit)
# Load election data
dat <- read.csv(system.file("extdata/Albania2013.csv", package="EFToolkit"))
# Run forensics analysis
results <- BasicElectionForensics(
dat,
Candidates = c("C035", "C050"),
Level = "Prefectures",
TotalReg = "Registered",
TotalVotes = "Ballots",
Methods = c("P05s", "C05s", "_2BL", "Sobyanin",
"DipT", "Skew", "Kurt", "Correlation"),
cores = 2,
R = 100 # Reduced for faster computation in example
)
# View results
print(results$table)
Basic Election Forensics Table
Interpretation Guidelines
| Test |
No fraud |
Interpretation |
| Second-digit mean (2BL) |
4.187 |
Values close to 4.19 are consistent with Benford’s Law. Systematic
deviations (too low or too high) may indicate artificial rounding or
human fabrication. |
| Last-digit mean (LastC) |
4.5 |
Randomly distributed last digits should average 4.5. Substantial
deviations suggest that numbers may not be uniformly random (e.g.,
preferences for certain digits). |
| Count of last-digit 0/5 ind. mean (C05s) |
0.2 |
About 20% of values should end in 0 or 5. Excess frequency of 0s or
5s may reflect rounding or strategic reporting. |
| Perc. last-digit 0/5 ind. mean (P05s) |
0.2 |
Same logic applies to percentages; deviations from 0.2 may signal
manipulation of turnout or result percentages. |
| Skewness (Skew) |
0 |
A symmetric distribution has skewness near zero. Positive skew
indicates a longer right tail (many low but a few very high values);
negative skew the opposite. Significant skewness may indicate anomalous
clustering of results. |
| Kurtosis (Kurt) |
3 |
A normal distribution has kurtosis of 3. Higher values (>3)
indicate peakedness (results too concentrated), while lower values
(<3) suggest excessive dispersion. |
| Unimodality test p-value (DipT) |
>0.05 |
A p-value greater than 0.05 supports unimodality (single peak).
Values below 0.05 indicate multimodality, possibly reflecting a mixture
of normal and manipulated results. |
| Sobyanin–Sukhovolsky |
near 0 |
Captures the degree of association between turnout and vote share.
Under normal conditions, this relationship should be weak or
nonexistent; strong positive association may indicate manipulated
results. |
| Correlation coefficient (Corr) |
near 0 |
Measures the correlation between turnout and vote share across
precincts. Values close to zero are expected in competitive elections;
high positive correlations suggest manipulated results. |
NonparamElectionForensics()
The NonparamElectionForensics function implements a
revised version of Shpilkin’s method for detecting election fraud
through nonparametric analysis of vote distributions. This method
identifies anomalous voting patterns by analyzing the relationship
between turnout and candidate support, detecting artificial vote
inflation through statistical modeling of “clean” electoral
behavior.
NonparamElectionForensics(data, Candidates, CandidatesText=NULL,
MainCandidate, TotalReg, TotalVotes=NULL,
Level=NULL, MaxThreshold=0.8,
FigureName, setcolors=NULL,
precinctLevel=TRUE, computeSD=NULL,
sims=10, mode_search=list(npeaks=5, sortstr=TRUE,
minpeakdistance=1, pick_by="height"),
man_turnout=NULL, grid_type="1D")
Output
| Output |
Type |
Description |
list_graphs |
list |
Collection of generated plots (ggplot2/plotly objects) |
base_stats |
list |
Basic fraud statistics for the whole dataset |
sim_all_stats |
list |
Simulation statistics for the whole dataset (if computeSD
specified) |
sim_hetero_stats_base |
data.frame |
Base statistics for regional analyses (if Level != “National”) |
sim_hetero_stats_sims |
data.frame |
Simulation statistics for regional analyses (if Level !=
“National”) |
fraud_precinct_data |
data.frame |
Precinct-level fraud estimates with uncertainty measures |
data |
data.frame |
Original input data with computed variables |
Level |
character |
Analysis level used in the function |
creationdate |
POSIXct |
Timestamp of when the output was created |
Precinct-Level Fraud Data Structure
When precinctLevel=TRUE, the
fraud_precinct_data component contains:
| Column |
Description |
id |
Unique precinct identifier |
base.fraud.votes |
Point estimate of fraudulent votes |
sim.precinct_mean |
Mean fraud estimate from simulations |
sim.precinct_sd |
Standard deviation of fraud estimates |
sim.sig_all |
Statistical significance indicator |
precinct_mean_hetero |
Regional-level fraud estimates (when Level != “National”) |
Primary Outputs
- Official Turnout: Reported voter participation
rate
- Real Turnout: Estimated legitimate turnout (clean
peak)
- Official Support: Reported candidate vote
share
- Real Support: Estimated legitimate support in clean
regions
- Ballot Stuffing: Votes added through turnout
inflation
- Ballot Switching: Votes transferred between
candidates
- Total Fraud: Combined fraudulent votes
- Proportional Fraud: Fraud as percentage of total
votes
Uncertainty Quantification
When computeSD is specified, the function provides:
- Parametric: Assumes binomial distributions for vote
generation
- Nonparametric: Uses bootstrap resampling for robust
estimates
Clean Peak Detection Methods
- “height”: Selects clean peak with highest vote
count
- “area”: Chooses clean peak with largest area under
curve
- “cluster”: Uses clustering to identify clean
peak
- “quantile”: Employs mixture models on turnout
distribution for clean peak detection
- “elipse”: Uses robust covariance estimation for
clean peak detection
Grid Types
- 1D Grid: Uses 1D grid of turnout distribution to
identify clean peak
- 2D Method: Uses 2D grid of joint distribution of
turnout and incumbent’s vote share
Example 1: National-Level Analysis with 1D Grid
library(EFToolkit)
# Load Russian 2000 election data
dat <- read.csv("electionfraud2000.csv")
# National analysis using 1D estimation method
res1 <- NonparamElectionForensics(dat,
Candidates = paste("P", 1:12, sep=""),
CandidatesText = c("Stanislav Govorukhin", "Umar Dzhabrailov",
"Vladimir Zhirinovsky", "Gennady Zuganov",
"Ella Pamfilova", "Alexei Podberezkin",
"Vladimir Putin", "Yuri Skuratov",
"Konstantin Titov", "Aman Tuleev",
"Grigorii Yavlinsky", "Against All"),
MainCandidate = "P7",
TotalReg = "NVoters",
TotalVotes = "NValid",
Level = "National",
MaxThreshold = 0.8,
mode_search = list(npeaks = 5, sortstr = TRUE,
minpeakdistance = 1, pick_by = "height"),
FigureName = "Russian Presidential Elections, 2000",
setcolors = c("royalblue2", "springgreen1","blue",
"red", "green","brown2",
"darkgreen", "yellow", "lawngreen",
"purple","chartreuse1", "orange"),
precinctLevel = TRUE,
computeSD = "nonparametric",
sims = 10,
grid_type = "1D")
# Summary of precinct-level fraud estimates
total_fraud <- sum(res1$fraud_precinct_data$base.fraud.votes, na.rm=TRUE)
# Result: 2,685,246 fraudulent votes detected
# Statistically significant fraud only
significant_fraud <- sum(res1$fraud_precinct_data$sim.precinct_mean[
res1$fraud_precinct_data$sim.sig_all==TRUE], na.rm=TRUE)
# Result: 811,501 significant fraudulent votes
> res1$base_stats
$`Whole dataset`
official_turnout real_turnout official_support real_support ballot_stuffing ballot_switching
6.820000e+01 6.700000e+01 5.330000e+01 5.200000e+01 1.258868e+06 1.426378e+06
total_fraud prop_fraud
2.685246e+06 6.990511e-02
# Display the table of region-level measures
View(round(res1$sim_hetero_stats_base, 3))
Precinct-level results
Example 2: Regional-Level Analysis with 1D Grid
# Regional analysis across all federal subjects
res2 <- NonparamElectionForensics(dat,
Candidates = paste("P", 1:12, sep=""),
CandidatesText = c("Stanislav Govorukhin", "Umar Dzhabrailov",
"Vladimir Zhirinovsky", "Gennady Zuganov",
"Ella Pamfilova", "Alexei Podberezkin",
"Vladimir Putin", "Yuri Skuratov",
"Konstantin Titov", "Aman Tuleev",
"Grigorii Yavlinsky", "Against All"),
MainCandidate = "P7",
TotalReg = "NVoters",
TotalVotes = "NValid",
Level = "regname", # Regional analysis
MaxThreshold = 0.8,
mode_search = list(npeaks = 5, sortstr = TRUE,
minpeakdistance = 1, pick_by = "height"),
FigureName = "Russian Presidential Elections, 2000",
setcolors = c("royalblue2", "springgreen1","blue",
"red", "green","brown2",
"darkgreen", "yellow", "lawngreen",
"purple","chartreuse1", "orange"),
precinctLevel = TRUE,
computeSD = "nonparametric",
sims = 10,
grid_type = "1D")
# Regional fraud estimates
regional_fraud <- sum(res2$fraud_precinct_data$precinct_mean_hetero, na.rm=TRUE)
# Result: 1,693,742 fraudulent votes across regions
# Statistically significant regional fraud
significant_regional_fraud <- sum(res2$fraud_precinct_data$precinct_mean_hetero[
res2$fraud_precinct_data$sim.sig_all==TRUE], na.rm=TRUE)
# Result: 1,125,151 significant fraudulent votes
| Image 1 |
Image 2 |
Image 3 |
Image 4 |
 |
 |
 |
 |
Region-level results
Example 3: National Analysis with 2D Grid
# 2D analysis using parametric uncertainty estimation
res3 <- NonparamElectionForensics(dat,
Candidates = paste("P", 1:12, sep=""),
CandidatesText = c("Stanislav Govorukhin", "Umar Dzhabrailov",
"Vladimir Zhirinovsky", "Gennady Zuganov",
"Ella Pamfilova", "Alexei Podberezkin",
"Vladimir Putin", "Yuri Skuratov",
"Konstantin Titov", "Aman Tuleev",
"Grigorii Yavlinsky", "Against All"),
MainCandidate = "P7",
TotalReg = "NVoters",
TotalVotes = "NValid",
Level = "National",
MaxThreshold = 0.8,
mode_search = list(npeaks = 5, sortstr = TRUE,
minpeakdistance = 1, pick_by = "height"),
FigureName = "Russian Presidential Elections, 2000",
setcolors = c("royalblue2", "springgreen1","blue",
"red", "green","brown2",
"darkgreen", "yellow", "lawngreen",
"purple","chartreuse1", "orange"),
precinctLevel = TRUE,
computeSD = "parametric",
sims = 10,
grid_type = "2D")
# 2D method fraud estimates
fraud_2d <- sum(res3$fraud_precinct_data$sim.precinct_mean, na.rm=TRUE)
# Result: -15,724,742 (negative values suggest model limitations)
# Significant fraud in 2D analysis
significant_2d_fraud <- sum(res3$fraud_precinct_data$sim.precinct_mean[
res3$fraud_precinct_data$sim.sig_all==TRUE], na.rm=TRUE)
# Result: 0 (no significant fraud detected with 2D method)
# Display the table of region-level measures
View(round(res3$sim_hetero_stats_base, 3))
Example 4: Analysis of Selected Regions with 2D Grid
# Focus on specific regions of interest
selected_regions <- c("Respublika Dagestan", "Gorod Moskva",
"Samarskaya Oblast`", "Volgogradskaya Oblast`")
dat_subset <- dat[dat$regname %in% selected_regions,]
res4 <- NonparamElectionForensics(dat_subset,
Candidates = paste("P", 1:12, sep=""),
CandidatesText = c("Stanislav Govorukhin", "Umar Dzhabrailov",
"Vladimir Zhirinovsky", "Gennady Zuganov",
"Ella Pamfilova", "Alexei Podberezkin",
"Vladimir Putin", "Yuri Skuratov",
"Konstantin Titov", "Aman Tuleev",
"Grigorii Yavlinsky", "Against All"),
MainCandidate = "P7",
TotalReg = "NVoters",
TotalVotes = "NValid",
Level = "regname",
MaxThreshold = 0.8,
mode_search = list(npeaks = 5, sortstr = TRUE,
minpeakdistance = 1, pick_by = "height"),
FigureName = "Russian Presidential Elections, 2000",
setcolors = c("royalblue2", "springgreen1","blue",
"red", "green","brown2",
"darkgreen", "yellow", "lawngreen",
"purple","chartreuse1", "orange"),
precinctLevel = TRUE,
computeSD = "nonparametric",
sims = 10,
grid_type = "2D")
# Access regional comparison results
print(res4$stats_summary)
Precinct-level results
Advanced Features
Multi-Level Analysis
When Level parameter specifies administrative units, the
function: - Performs analysis for the entire dataset - Conducts separate
analyses for each administrative unit - Aggregates results across
regions - Provides comparative statistics
Robust Peak Detection
The algorithm implements multiple fallback strategies: 1. Primary
method specified in pick_by 2. Alternative clustering
approaches if primary fails 3. Simple peak detection as ultimate
fallback 4. Manual override through man_turnout
parameter
Precinct-Level Estimation
When precinctLevel=TRUE, the function: - Estimates fraud
at individual precinct level - Uses post-stratification for statistical
adjustment - Provides significance testing for precinct estimates -
Enables spatial analysis integration
Best Practices
Method Selection
- Start with 1D analysis for initial exploration
- Use regional-level analysis for heterogeneous
countries
- Apply nonparametric uncertainty for robust
estimates
- Test multiple
pick_by methods for
sensitivity analysis
Parameter Tuning
- Increase
sims for more precise
uncertainty estimates
- Adjust
MaxThreshold based on
country-specific context
- Experiment with
mode_search parameters
for optimal peak detection
- Use custom
setcolors for
publication-quality visualizations
Result Validation
- Compare across estimation methods (1D vs 2D)
- Examine statistical significance alongside
magnitude
- Cross-validate with other forensic indicators
- Consider substantive electoral context
Finite Mixture Model()
ComputeFiniteMixtureModel - A legacy
implementation of Walter Mebane’s Finite Mixture Model for electoral
data analysis.
The model uses Bayesian estimation techniques with EM-algorithm-like
iterations to estimate the posterior probabilities of each precinct
belonging to each fraud category.
⚠️ Important Note: This function is legacy
code that is no longer actively maintained or supported. It may
have dependencies on outdated packages or contain unoptimized
algorithms.
ComputeFiniteMixtureModel(dat, MainCandidate = "Votes", TotalReg = "NVoters",
TotalVotes = "NValid", cores = 2, itstartmax = 1)
Output
Returns a list containing FMM estimates with the following
structure:
list(
FF_null = matrix, # Null model results (estimates and standard deviations)
FFlist_null = list, # Full null model output including posterior probabilities
FF = matrix, # Main model results (estimates and standard deviations)
FFlist = list # Full main model output including posterior probabilities
)
The output matrices contain the following parameters:
| Parameter |
Description |
incremental |
Proportion of incremental fraud |
extreme |
Proportion of extreme fraud |
alpha |
Fraud intensity parameter |
turnout |
Turnout rate parameter |
winprop |
Winning proportion parameter |
sigma |
Standard deviation for vote proportions |
stdAtt |
Standard deviation for attendance |
theta |
Convergence test parameter |
loglik |
Log-likelihood value |
df |
Degrees of freedom |
Usage Example
library(EFToolkit)
# Load sample data
dat <- read.csv(system.file("extdata/ruspres2020.csv", package = "EFToolkit"))
dat <- subset(dat, select = c("region", "NVoters", "NValid", "Votes"))
datc <- dat[dat$region == "Volgogradskaya Oblast`", ]
# Run FMM analysis (commented out due to long computation time)
# res <- ComputeFiniteMixtureModel(datc,
# MainCandidate = "Votes",
# TotalReg = "NVoters",
# TotalVotes = "NValid")
Key Features
- Mixture Modeling: Implements a finite mixture model
with multiple fraud components
- Parallel Computing: Supports multi-core processing
for faster computation
- Robust Estimation: Uses genetic algorithms
(
rgenoud) for parameter optimization
- Statistical Inference: Provides estimates with
standard errors
Limitations & Considerations
- Computational Intensity: The function can be very
slow for large datasets
- Legacy Status: No longer actively supported or
maintained
- Algorithm Complexity: Implements sophisticated
statistical models that may require domain expertise to interpret
- Parameter Sensitivity: Results may be sensitive to
starting values and iteration limits